20 research outputs found

    An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

    Get PDF
    This paper describes an unsupervised algorithm for segmenting categorical time series into episodes. The Voting-Experts algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two “expert methods” decide where in the window boundaries should be drawn. The algorithm successfully segments text into words in four languages. The algorithm also segments time series of robot sensor data into subsequences that represent episodes in the life of the robot. We claim that Voting-Experts finds meaningful episodes in categorical time series because it exploits two statistical characteristics of meaningful episodes

    Tag based models of English text

    Get PDF
    The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as “tags”). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag character and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better

    Grammatical Herding

    No full text

    Correcting English text using PPM models

    Get PDF
    An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out. This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 95.9% to 96.6%, a decrease of about 10 errors per page

    Adaptive models of English text

    Get PDF
    High quality models of English text with performance approaching that of humans is important for many applications including spelling correction, speech recognition, OCR, and encryption. A number of different statistical models of English are compared with each other and with previous estimates from human subjects. It is concluded that the best current models are word based with part of speech tags. Given sufficient training text, they are able to attain performance comparable to humans

    Constituent Grammatical Evolution

    No full text

    Universal text preprocessing and postprocessing for PPM using Alphabet Adjustment

    No full text
    In this paper, we introduce several new universal pre-processing techniques to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially 'adjust' the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text

    A 300 MB Turkish Corpus and Word Analysis

    No full text
    © Springer-Verlag Berlin Heidelberg 2002.In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process
    corecore